Automatic Structuring of Written Texts

نویسندگان

  • Marek Veber
  • Ales Horák
  • Rostislav Julinek
  • Pavel Smrz
چکیده

This paper deals with automatic structuring and sentence boundary labelling in natural language texts. We describe the implemented structure tagging algorithm and heuristic rules that are used for automatic or semiautomatic labelling. Inside the detected sentence the algorithm performs a decomposition to clauses and then marks the parts of text which do not form a sentence, i.e. headings, signatures, tables and other structured data. We also pay attention to the processing of matched symbols in the text, especially to the analysis of direct speech notation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic identification of language varieties: The case of Portuguese

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...

متن کامل

Learner Engagement with Structuring and Problematizing in Scaffolded Writing Tasks: A Mixed-MethodsMultiple Case Study

The present study set out to delineate to what extentfive intermediate learners engaged in structuring and problematizing scaffolding in two writing tasks. The study aimed at illuminating how the participants engaged with structuring and problematizing scaffolds cognitively, behaviorally, and affectively.  Learners’ written essays, think-aloud protocols, and interviews shaped the data sources w...

متن کامل

Tools for Terminology Processing

Automatic terminology processing appeared 10 years ago when electronic corpora became widely available. Such processing may be statistically or linguistically based and produces terminology resources that can be used in a number of applications : indexing, information retrieval, technology watch, etc. We present the tools that have been developed in the IRIN Institute. They all take as input te...

متن کامل

Hwæt! LOL! – common formulaic functions in Beowulf and blogs

We consider the functions that formulae perform in two types of written texts which maintain close links to oral forms: Old English epic poetry and blogs. Five oralityrelated functions of formulae are identified in both datasets: discourse-structuring, filler, epithetic, gnomic, and tonic. A sixth type of formulaic function, theacronymic, necessarily tied to the written medium, is also found in...

متن کامل

A Database of Freely Written Texts of German School Students for the Purpose of Automatic Spelling Error Classification

The spelling competence of school students is best measured on freely written texts, instead of pre-determined, dictated texts. Since the analysis of the error categories in these kinds of texts is very labor intensive and costly, we are working on an automatic systems to perform this task. The modules of the systems are derived from techniques from the area of natural language processing, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999